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Cluster analysis is an exploratory multivariate 
technique that continues to have wide application in 
studies of geographic variation and other kinds of in- 
traspecific population structure. It especially can be 
valuable when used in conjunction with ordination tech¬ 
niques, such as multidimensional scaling and eigen¬ 
vector-based methods, and with other kinds of tree¬ 
building procedures (Kruskal, 1977; Pruzansky et ah, 
1982; Lessa, 1990). 

During a recent study of geographic variation in 
the rodent Peromyscus spicilegus (Bradley et ah, 1996), 
a peculiar effect of sample-size variation on UPGMA 
dendrograms was noted, as described below. In hind¬ 
sight, the effect is an obvious result of sampling theory 
and can easily be demonstrated by simulation. How¬ 
ever, because the effect is likely to arise with other 
kinds of data and tree-building algorithms (Mooers et 
ah, 1995) and has not been investigated seriously in 
the literature, it is worth reporting in some detail. It is 


a topic that was intimated by Sneath and Sokal (1973) 
but has been unmentioned in methodological texts 
and monographs in systematics and ecology 
(Pimentel,1979; Gauch, Jr.,1982; Digby and 
Kempton,1987; Ludwig and Reynolds, 1988). It also 
has been lighty regarded in the wider literature on hier¬ 
archical clustering (Van Ryzin, 1977; Milligan and Coo¬ 
per, 1987; Bock, 1988; Jam and Dubes, 1988; Diday 
et al., J988; Fukunaga, 1990; Kaufman and 
Ronsseeuw, 1990; Everitt, 1993; Diday et ah, 1994; 
Arable and Hubert, 1995; Arabic et ah, 1966), in which 
data to be clustered generally are taken at face value, 
although probabilistic aspects are often considered in 
partition clustering methods (Bock, 1989;Flury, 1995, 
Puzicha et ah, 2000), The goal of our study is to 
demonstrate that small sample sizes may have an ad¬ 
verse affect on the results of cluster analysis. We 
have used empirical data obtained from a previous study 
(Bradley et ah, 1989) and computer simulations to il¬ 
lustrate the potential pitfalls. 






2 


Occasional Papers, Museum of Texas Tech University 


An Example 


A set of morphometric measurements, 18 cra¬ 
nial and 5 external, was taken on 356 adult individuals 
of Peromyscus spicilegus from western Mexico- These 
data were taken from a previous study (Bradley et ah, 
1989) that examined the systematic status of P. 
spicilegus. All measurements were taken by a single 
individual (R, D. Bradley); measurements and locali¬ 
ties are detailed in Bradley et ah (19 89). Samples were 
small for some localities and a small number of miss¬ 
ing data (L6% of the total data set) was estimated 
using the iterative maximum-likelihood estimation-maxi¬ 
mization (EM) method as described by Little and Rubin 
(1987) and Strauss and Atanassov (in press). Euclid¬ 
ean distances were then calculated among locality 
samples based on character means of standardized data 
(mean = 0 and standard error = 1), as in Bradley et ah 
(1989). The resulting UPGMA dendrogram (Fig. 1) 
displays a rather striking pattern: samples having small 
sample sizes (1-6, in this case) tend to “chain” indi¬ 
vidually to the outside of the dendrogram. In addition, 
within well-defined clusters, the samples having the 
smallest sample sizes often, but not always, chain sin¬ 
gly or in pairs to the outside of the clusters. When 
samples with few observations are omitted and the 
dendrogram is reconstructed, other small-sized samples 
that previously had been embedded within clusters 
sometimes take their place by chaining to the outside. 

In the Peromyscus study (Bradley et ah, 1989), 
an analogous effect was noted in principal-component 
analyses in which the covariance matrix of sample 
means were used as observations rather than individu¬ 
als to avoid biasing the results by localities represented 
by large sample sizes. In the resulting scatterplots of 
principal-component scores (Fig, 2), small samples 
tend to lie to the periphery of clusters and therefore 
tend to control the spatial positions and orientations of 
the eigenvectors. 

The effects of small sample sizes particularly are 
not only striking in this example, but can be recog¬ 
nized in more subtle variations in published analyses 
(e.g,, Strauss, 1980; Bradley et ah, 1989; Carleton and 
Musser, 1995; Gay and Best, 1995; Reichling, 1995; 
Teixeira and Musick, 1995; Arroyo-Cabrales and 
Owen, 1996; Engle and Summers, 1999). In these 
examples, biological explanations usually were proposed 
to account for the observed patterns. It is perhaps not 


surprising that small samples should be subject to sig¬ 
nificant sampling variation and should fall out as “out¬ 
liers” on dendrograms. However, we demonstrate by 
simulation that small samples are capable of affecting 
the clustering patterns of larger samples as well as the 
overall tree topology. 
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Euclidean Distance 


Figure 1. UPGMA dendrogram based on Euclidean dis¬ 
tances among 32 geographically distributed samples of 
Peromyscus spicilegus. Distances are based on means of 
23 morphometric characters. Numeric labels identify lo¬ 
calities; sample sizes are shown in parentheses. Data are 
from Bradley et ah (1996). 
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Figure 2. Scattergram of scores of samples of 
Peromyscus spicilegus on the first two principal 
components of the covariance matrix of sample 
means. + = samples with >10 observations; 
o = samples with < 10 observations. 


Simulations 

Clustering in the Absence of Group Structure 

The sample-size effect is most likely to occur in 
the absence of any other structure in the data, biologi¬ 
cal or otherwise. To simulate this, we assigned states 
for five characters and 10 samples by drawing char¬ 
acter states randomly from a single, arbitrarily chosen 
normal distribution, N(/i = 10, a = 2) Sample sizes 
for the 10 samples were set at n t - {1,1,2, 2, 5, 5, 50, 
50, 60, 60}, and the values for each character were 
averaged across individuals. This and all other analy¬ 
ses were programmed in Matlab v4.2c (Mathworks, 
1997). Typical resulting UPGMA dendrograms of 
Euclidean distances among sample means are illustrated 
in Figure 3. There is an obvious and very strong ef¬ 
fect of sample size on the clustering structure, with 
samples having large rc/s clustering together and 
samples with decreasing n *s chaining consecutively 
(not always in strict sample-size sequence) to the main 
cluster. Repeated simulations varying the numbers of 
samples, characters per sample, and sample-size dis¬ 
tributions display this same effect to varying degrees. 




Figure 3. Two replicates (a and b) of random¬ 
ized cluster analyses of Euclidean distances 
based on means of five characters for 10 
samples. Sample means were calculated from 
observations drawn from the same normal dis¬ 
tribution, N (p = 10, c = 2). Sample sizes are 
shown in parenthesis. 
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Clustering in the Presence of Group Structure 

To simulate biological structure, the 10 samples 
were divided into two groups (perhaps representing 
two geographical or ecological domains), the charac¬ 
ter states from each group being sampled from nor¬ 
mal distributions identical in variance but differing by 
1.5a in their means: N (lO ,2) and ^(13,2). Sample 
sizes for the five samples within each group were set 
at n. = (1, 2, 5, 50, 60} and, as before, UPGMA den¬ 
drograms were computed from Euclidean distances 
among sample means. It is to be expected that the 
dendrograms should display two main clusters (Fig. 
4) because of the structure built into the sampling re¬ 
gime. However, because the samples within each of 
the clusters differ only randomly, there is an observ¬ 
able effect of sample-size variation within clusters: the 
samples having the largest sample sizes group together, 
with successively smaller samples chaining consecu¬ 
tively. Because character variation is random in this 
model, the smaller samples occasionally are sufficiently 
similar to form secondary clusters (Fig. 4a). 

The presence of group structure was extended 
by considering 15 samples in three groups, with group 
means assigned randomly from a uniform distribution 
over the interval [10, 15]. Character states for five 
characters were selected from normal distributions 
having a constant and centered on the group means. 
Two typical examples are shown in Figure 5. In the 
first example (Fig. 5a), groups A ( x A =12.6) and B 
<p B = 14.2) are slightly more similar to one another 
than either is to group C (x c =10.2)> and the dendro¬ 
gram reflects this by clustering A and B samples which 
have the largest sample sizes. Some of the smaller A 
and B samples then chain to this cluster. The two 
largest C samples also cluster together, followed by 
two of the smaller C samples. These clusters join and 
the smallest samples in the data set then chain to the 
outside. 

In the second example (Fig. 56), groups A 
(x A =14.7) and B (f 5 = 14.2) are very different from 
group C (x c = 12j?). The largest samples of A and B 
cluster together first, and the smaller A and B samples 
chain consecutively to this central cord. Group C com¬ 
prises its own cluster, with the two largest samples 
joining first. This basic pattern is disturbed only by 
two small samples of B and C, which join the oppos¬ 
ing clusters as outliers. 
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Figure 4. Two replicates (a and b) of random¬ 
ized cluster analyses of Euclidean distances 
based on means of five characters for two 
groups of five samples each. Within each 

Systematic Changes in Tree Structure due to 
Sample-size Effects 

These simple examples illustrate the two effects 
of small samples: the alteration of similarity relation¬ 
ships, and the tendency to increase the degree of asym¬ 
metry or imbalance of the dendrogram by consecu¬ 
tively chaining small samples to clusters fonned by 
larger ones. The magnitudes of these effects, and 
their dependence on number of characters and vary¬ 
ing degrees of relationship among samples, were stud¬ 
ied with the following simulation design. “True” mean 
character values for each of 10 samples 
( Uj y = i.io ) were assigned randomly from a uni- 
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Figure 5. Two replicates (a and b) of random¬ 
ized cluster analyses of Euclidean 
distances based on means of five characters 
for three groups of five samples each. Within 
each group samples were drawn 
from the same normal distribution, 

A(Xi, 2), where the three group means X,- were 
drawn from the uniform distribution 
C/( 10, 15). Sample sizes are shown in 
parentheses. (a)X^= 12.6, Xb = 14.2, X c = 10.2. 
(b)X,= 14.7,X*= 14.2, Xc= 12.6. 

form distribution over the interval [10, 20]; the result¬ 
ing group differences served as proxies for real bio¬ 
logical differences among taxa. Because the sampling 
error of the mean is inversely proportional to V« , a 
distribution of sample sizes among samples was cre¬ 
ated by sampling Jnj,i = 1,...,10 from a uniform dis¬ 
tribution £/[0>V50], squaring the resulting values and 


rounding them upward to the nearest integers, which 
thus varied from 1—50 according to a square-root dis¬ 
tribution. Heterogeneity among sample sizes was mea¬ 
sured as var (jn 7). Mean character-state values x ( for 
each sample were then estimated by sampling (with 
the corresponding sample size) each character from a 
normal distribution ]V(/^ ,cr = 2) and averaging across 
observations; the same distribution was used for all 
characters of each sample. 

From these values two UPGMA dendrograms 
were produced, T based on the “true” means (which, 
of course, are unknown in actual studies) and T on 
the estimated means (Fig. 6). The differences be¬ 
tween these trees must be solely a function of varia¬ 
tion in sample size among samples and resulting varia¬ 
tion in estimated character means. Two measures were 
used to describe the differences between correspond¬ 
ing trees that were anticipated based on the preceding 
observations and simulations. The expected increase 
in tree asymmetry due to the sample-size effect was 
measured using the standardized version of Colless’s I 
(Colless, 1980, 1995; Shao and Sokal, 1990), a mea¬ 
sure of tree asymmetry. The index was computed 
separately for the two dendrograms (7 for the “true” 
tree and 7 for that based on estimates) and the differ¬ 
ence expressed as AI = I e - 7, . The degree of con¬ 
cordance in topology between corresponding trees was 
measured using the “agreement metric” d(T i , T) of 
Goddard et al. (1994), standardized to vary between 
0-1 for 10 samples. 

The procedure of assigning and estimating sample 
means and of producing and comparing the resulting 
dendrograms was repeated 500 times for a given num¬ 
ber of characters, which was varied from 5—30 in in¬ 
crements of 5. One set of results, those based on 25 
characters, is shown in Figure 7, and all results are 
summarized in Table 1. The patterns are striking: as 
the number of characters increases the sample-size 
effect becomes greater, both by increasing the degree 
of asymmetry of the tree and by increasing the differ¬ 
ence between the tree based on true means and that 
based on estimated means. For approximately 15 or 
more characters, highly significant correlations with 
the amount of variability among sample sizes are evi¬ 
dent, both for asymmetry increase (A/) and for topol¬ 
ogy incongruence (d). As the variance among sample 
sizes increases, the amount of tree asymmetry tends 
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to increase because of the chaining of small samples, 
while the level of topological agreement with the “true” 
tree decreases, primarily due to the chaining of small 
samples but also due to rearrangements within the main 
clusters. 

These simulations also predict the effect of esti¬ 
mating sample means when all sample sizes are equal, 
in which case var Lfn ~)= 0 and the expected sample 
size is about 13 12 .5 ), In Fig. la, for ex¬ 

ample, the predicted increase in asymmetry for a bal¬ 
anced design is not significantly different from zero, 
indicating that the sampling error introduced by esti¬ 
mating group means does not, by itself, lead to a chain¬ 
ing effect. However, a dendrogram produced from 
estimated means is expected to differ somewhat in 
topology from that produced from true means, and 
Fig. lb suggests that the expected level of topological 
agreement between the two trees is only about 
d = 0.75 even in the absence of sample-size variability, 
and degrades from there. The value d = 0.75 is sig¬ 
nificantly different from 1.0 at/? < 0 . 001 . 


Discussion 


10 5 0 

Figure 6 . Randomized cluster analysis of 
Euclidean distances based on means of five 
characters for 10 samples, (a) Dendrogram 
based on “true” means jin, drawn indepen¬ 
dently for the 10 samples from the uniform 
distribution £7(10, 20). Sample sizes, al¬ 
though indicated in parentheses, are irrel¬ 
evant. (b) Corresponding dendogram 
based on estimated means, drawn sepa¬ 
rately for each sample from a normal distri¬ 
bution N {ji ? 6 = 2 ), centered on the 
sample's true mean. Sample sizes are shown 
in parentheses. 


/ 


The sample-size effect described here is a result 
of the well-known Central Limit Theorem, which states 
that the sampling distribution of any statistic that can 
be expressed as a function of the sum of n indepen¬ 
dent observations (which includes the mean as a spe¬ 
cial case) is asymptotically normal, with an expected 
standard error of = 77 , where's is an estimate of 
the standard deviation of the population. Thus for 
small samples the sampling variation of the mean can 
be relatively wide, producing imprecise (though accu¬ 
rate) estimates. An example is shown graphically in 
Figure 8 , which portrays a random sample of obser¬ 
vations from a bivariate normal distribution centered 
on the point ( 0 , 0 ) (Fig. 8 a), and a corresponding col¬ 
lection of estimates of this true (parametric) mean vec¬ 
tor (Fig. 8 b). The pattern is obvious: means of small 
samples typically lie relatively far from the population 
value they represent, while means of large samples 
tend to be closer to this value. This pattern extrapo¬ 
lates to three or more dimensions, projections of which 
tend to look like the principal-component space of Fig¬ 
ure 2 . 
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Table I. Asymmetry difference (Al) and tree concordance (A) between pairs of “true” and estimated trees, as a 
function of number of characters. Each row of the table is based on 500 replications. 


No. characters 

A/ 


d 


±2SE 

Correlation 

± 2SE 

Correlation 

5 

0.12*0.04 

0.04 

0.86 ±0.02 

-0 14 

10 

0.16 ±0.07 

0.07 

0.83 ±0.03 

-0.09 

15 

0.19 ±0.06 

0.26 

0.72 ±0.07 

-0.57 

20 

0.28 ±0.03 

0.42 

0.68 ±0.08 

-0.66 

25 

0.30 ±0.06 

0.53 

0.56±Q.06 

-0.51 

30 

0.37 ±0.04 

0.51 

0.39 ±0.04 

-0.26 


a Correlation with var {yffy ) 


Similar biases may be observed for discrete poly¬ 
morphic characters as well as continuously varying 
ones, as well as for other kinds of tree-building meth¬ 
ods. For example, Mooers et ak (1995) recently found 
that “phylogenetic noise” — variation in the states of 
uninformative (i.e., noisy) cladistic characters — in¬ 
creases the expected magnitude of tree imbalance for 
cladograms based on maximum parsimony. Thus sam¬ 
pling variation comes in different flavors, but might 
have similar effects across a wide range of proce¬ 
dures. 

Because trees based on large sets of characters 
show proportionately less uncertainty than trees based 
on few characters when considering variation among 
characters (Felsenstein, 1985; Sneath, 1986), it might 
seem counterintuitive that the sample-size effect (which 
is due to variation within characters) is likely to in¬ 
crease rather than decrease with a larger number of 
characters. However, the more characters that are 
sampled, the more likely it is that a small sample will 
have an outlying mean for at least one of them. In 
Figure 86, for example, the single-observation sample 
at the “bottom” of the scatter has a value very near the 
true mean for character X , but is an outlier on X r 
Both of the five-observation samples have means close 
to the true mean of X' but are relative outliers in dif¬ 
ferent directions on X ] . 

This effect of character number exists whether 
or not characters are correlated, although such corre¬ 
lations do reduce the effective number of independent 
sources of information in the data set. If characters 
are correlated, circular multivariate distributions be¬ 


come elliptical and it is appropriate to use Mahalanobis 
rather than Euclidean distances as measures of dis¬ 
similarity. However, the principles of sampling under¬ 
lying the sample-size effect hold for any form of dis¬ 
tribution. 

These effects are most likely to be encountered 
in studies of geographic variation of larger animals and 
plants, because in such studies sizes of locality samples 
are typically small and dispersed and biological pat¬ 
terns may be weak. However, even in the presence of 
strong clustering, some secondary effect is likely to 
be noticed. The overall effect is a tug-of-war between 
biological signal and statistical artifact. The result de¬ 
pends on the relative strengths of these components in 
various parts of the tree. 

The increase in tree asymmetry is likely to be 
mitigated by the tendency of UPGMA dendrograms to 
display only moderate asymmetry (Lance and Williams, 
1966). However, this effect must then be compen¬ 
sated for by the alteration of similarity relationships 
within the tree. This is likely the reason that, in the 
simulations, the effect of decreasing correlation with 
“true” structure was greater than the effect of increas¬ 
ing tree asymmetry, although the difference in effects 
is also undoubtedly a result of differences in sensitiv¬ 
ity of the two kinds of tree-comparison metrics. We 
have yet to study the magnitude of the sample-size 
effect on the “flexible” clustering algorithm (Lance and 
Williams, 1966; Milligan, 1989; Belbin et al., 1992), in 
which it is possible to explicitly control the degree of 
tree asymmetry by varying the parameter p. 
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Figure 7. Dependence of asymmetry increase (a) 
due to sampling effect A I, and concordance (b) 
with the true tree, d, for the case of 25 characters. 


Figure 8. A random sample of (a) 200 observa¬ 
tions drawn from a bivariate normal distr ibution 
with zero means and unit variances for the two 
variables X 1 and X 2 and (b) means from the same 
distribution (corresponding sample sizes are in¬ 
dicated beside each point). 


Cluster analyses are designed to produce clus¬ 
ters whether they exist or not. Therefore, a major 
weakness of cluster analysis as an analytic method is 
the fact that, except in special cases, it is difficult to 
assess the statistical “significance” of the results 
(Sokal,1977; Sneath,1979; Milligan, 1981; Bock, 1985; 
Rreckenridge,1989). Procedures are available for as¬ 
sessing the significance of the overall clustering struc¬ 
ture (Dubes and Jain, 1979; Milligan and Isaac, 1980; 


Smith and Dubes, 1980; Milligan, 1981, 1996) and, if 
the sampling distributions are known or can be esti¬ 
mated, it is also possible to randomize the data to ob¬ 
tain confidence intervals around nodal values (Hartigan, 
1977; Strauss, 1982; Morey et ah, 1983; Murtagh, 
1983). It is possible to apply the bootstrap to 
ultrametric trees (Netnec and Bnnkhurst, 1988; Pamilo, 
1990) in much the same way that it has been used for 
additive trees (Felsenstein, 1985; Penny and Penny, 
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1990), and this should become standard procedure. 
Alternatively, if one assumes that the data available come 
from a mixture of several different populations for 
which the distributions approximate a standard type 
(e.g., multivariate normal), the clustering problem is 
then transformed into the problem of estimating, for 
each of the samples, the parameters of the assumed 
distribution and the probability that an observation 
conies from that population (Shieh and Joseph, 1992; 
Everitt and Dunn, 1992; Banerjee and Rosenfeld, 
1993). Strauss (2000) described a similar approach 
that does not depend on assumptions about underlying 
structure. These “^-means’’ approaches have the merit 
of moving the clustering problem away from ad hoc 
procedures towards the more usual and productive 
statistical framework of parameter estimation and 


model testing; they have the disadvantage of not di¬ 
rectly providing a hierarchical framework for under¬ 
standing patterns of biological variation. 

All of these alternatives depend critically either 
upon assumptions about the forms of underlying dis¬ 
tributions or on the randomized resampling of empiri¬ 
cal distributions. Other reasonable procedures are to 
carry out analyses both with and without small samples 
to determine if the basic clustering structure changes, 
or to superimpose small samples onto the solution pro¬ 
vided by larger samples (Perrin et ah, 1994). How¬ 
ever, the significance of the sample-size effect for very 
small samples is that it is a problem without a clear 
solution, except to provoke additional collecting. 
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